Statistics for Political Science
August 7, 2025
This class is designed to give you the skills to pursue your own independent research projects in the future; don’t worry about becoming perfect.
“Correlation doesn’t equal causation”
…but all our statistical evidence about causation is built on correlations.
We need to understand the basics of statistical correlations:
We will introduce quite a lot of notation in the next few weeks. We have tried to keep notation to a minimum, but it is often necessary to communicate complex ideas quickly and precisely.
Notation today:
Quantitative political science seeks to understand political phenomena through numerical representation.
“Different states are debating when, if at all, abortion should be legal during a woman’s pregnancy. A normal pregnancy could go up to as many as 40 weeks. Until what point in a pregnancy do you think a woman should be legally allowed to obtain an abortion?”
Respondents choose number of weeks (0–40)
Fundamental tension in quantitative analysis: detail vs parsimony.
Compare averages across categories:
Summarize relationships across multiple continuous variables:
Central tendency refers to measures that identify the center of a dataset.
Continuous and ordinal variables:
Categorical variables: report as tables or recode in binary.
¹ This value is often meaningless for continuous variables, so is rarely included.
\[ \bar{Y} = \frac{\sum_{i=1}^n Y_i}{n} \]
Components:
1. Zero-Sum Property
\[ \sum_{i=1}^n (Y_i - \bar{Y}) = 0 \]
2. Least-Squares Property
\[
\sum_{i=1}^n (Y_i - \bar{Y})^2 \;<\;
\sum_{i=1}^n (Y_i - c)^2
\quad \forall\; c \neq \bar{Y}
\]
The median is the middle value of a variable when the observations are ordered from smallest to largest.
It divides the distribution into two equal halves: 50% of values are below the median and 50% are above it.
If there is an odd number of observations, the median is the middle value.
If there is an even number of observations, the median is the average of the two middle values.
Key advantage: The median is resistant to outliers and skewed data.
Imagine there are 5 people sitting in a bar:
Imagine that Elon Musk walks into the bar:
Variance is the typical (squared) distance between an observation and the mean:
\[ \mathrm{var}(Y) = s^2 = \frac{\sum_{i=1}^n (Y_i - \bar{Y})^2}{n} \]
\[ \mathrm{sd}(Y) = s = \sqrt{\frac{\sum_{i=1}^n (Y_i - \bar{Y})^2}{n}} \]
Assuming a normal distribution:
\[ \mathrm{cov}(X, Y) = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{n} \]
cov(Age, Weeks of Abortion) = -18.4
\[ \mathrm{cor}(X,Y) = \frac{\mathrm{cov}(X,Y)}{\mathrm{sd}(X)\,\mathrm{sd}(Y)} \]
cor(Age, Weeks of Abortion) = -0.086
Histograms allow us to visualize the distribution of a single continuous variable.
Box plots help us compare the distributions of a continuous variable across categories.
We use scatter plots to visualize the relationship between two continuous variables: